Back

Journal of Biomedical Informatics

37 training papers 2019-06-25 – 2026-03-07

Top medRxiv preprints most likely to be published in this journal, ranked by match strength.

1
GatorTron: A Large Clinical Language Model to Unlock Patient Information from Unstructured Electronic Health Records
2022-02-28 health informatics 10.1101/2022.02.27.22271257
#1 (23.7%)
Show abstract

ObjectiveTo develop a large pretrained clinical language model from scratch using transformer architecture; systematically examine how transformer models of different sizes could help 5 clinical natural language processing (NLP) tasks at different linguistic levels. MethodsWe created a large corpus with >90 billion words from clinical narratives (>82 billion words), scientific literature (6 billion words), and general English text (2.5 billion words). We developed GatorTron models from scratch ...

2
Scaling text de-identification using locally augmented ensembles
2024-06-20 health informatics 10.1101/2024.06.20.24308896
#1 (23.4%)
Show abstract

The natural language text in electronic health records (EHRs), such as clinical notes, often contains information that is not captured elsewhere (e.g., degree of disease progression and responsiveness to treatment) and, thus, is invaluable for downstream clinical analysis. However, to make such data available for broader research purposes, in the United States, personally identifiable information (PII) is typically removed from the EHR in accordance with the Privacy Rule of the Health Insurance ...

3
Predicting Behavioral Determinants of Health from Clinical Text Using Transformer Models and BiLSTM
2025-10-15 health informatics 10.1101/2025.10.13.25337944
#1 (23.4%)
Show abstract

ObjectiveSocial and behavioral determinants of health play a critical role in patient outcomes, yet much of this information is documented only in unstructured clinical text rather than structured records. Detecting these factors using natural language processing enables a deeper understanding of patient health and supports better decision-making. This study aims to improve the prediction of Behavioral Determinants of Health (BDoH) from medical records by systematically comparing multiple transf...

4
Keyphrase Identification Using Minimal Labeled Data with Hierarchical Context and Transfer Learning
2023-01-26 health informatics 10.1101/2023.01.26.23285060
#1 (23.3%)
Show abstract

BackgroundInteroperable clinical decision support system (CDSS) rules provide a pathway to interoperability, a well-recognized challenge in health information technology. Building an ontology facilitates creating interoperable CDSS rules, which can be achieved by identifying the keyphrases (KP) from the existing literature. Ontology construction is traditionally a manual effort by human domain experts, and the newly advanced natural language processing techniques, such as KP identification, can ...

5
Advancements in Multilingual Biomedical Natural Language Processing: exploring Large Language Models for Named Entity Recognition and Linking
2026-01-23 health informatics 10.64898/2026.01.22.26344605
#1 (23.3%)
Show abstract

ObjectiveNamed Entity Recognition (NER) and Biomedical Entity Linking (BEL) are essential for transforming unstructured Electronic Health Records (EHRs) into structured information. However, tools for these tasks are limited in non-English biomedical texts such as Dutch and Italian. This study investigates the use of prompt-based learning with Large Language Models (LLMs) to perform multilingual NER and BEL using minimal domainspecific data, while addressing annotation preservation during corpus...

6
Overview of the 8th Social Media Mining for Health Applications (#SMM4H) Shared Tasks at the AMIA 2023 Annual Symposium
2023-11-08 health informatics 10.1101/2023.11.06.23298168
#1 (23.2%)
Show abstract

The aim of the Social Media Mining for Health Applications (#SMM4H) shared tasks is to take a community-driven approach to address the natural language processing and machine learning challenges inherent to utilizing social media data for health informatics. The eighth iteration of the #SMM4H shared tasks was hosted at the AMIA 2023 Annual Symposium and consisted of five tasks that represented various social media platforms (Twitter and Reddit), languages (English and Spanish), methods (binary c...

7
Hazard-aware adaptations bridge the generalization gap in large language models: a nationwide study
2025-02-17 health informatics 10.1101/2025.02.14.25322312
#1 (23.1%)
Show abstract

Despite growing excitement in deploying large language models (LLMs) for healthcare, most machine learning studies show success on the same few limited public data sources. It is unclear if and how most results generalize to real-world clinical settings. To measure this gap and shorten it, we analyzed protected notes from over 100 Veterans Affairs (VA) sites, focusing on extracting smoking history--a persistent and clinically impactful problem in natural language processing (NLP). Here we applie...

8
Can Large Language Models Reduce the Cost of Extracting Data from Electronic Health Records for Research?
2026-01-11 health informatics 10.64898/2026.01.09.26343792
#1 (23.0%)
Show abstract

ObjectiveMuch medical data is only available in unstructured electronic health records (EHR). These data can be obtained through manual (human) extraction or programmatic natural language processing (NLP) methods. We estimate that NLP only becomes economically competitive with manual extraction when there are ~6500 EHRs records. We have found that there is interest from clinicians and researchers in using NLP on projects with fewer records. We examine whether a large language model (LLM) can be ...

9
Development and validation of MedDRA Tagger: a tool for extraction and structuring medical information from clinical notes
2022-12-14 health informatics 10.1101/2022.12.14.22283470
#1 (22.9%)
Show abstract

Rapid and automated extraction of clinical information from patients notes is a desirable though difficult task. Natural language processing (NLP) and machine learning have great potential to automate and accelerate such applications, but developing such models can require a large amount of labeled clinical text, which can be a slow and laborious process. To address this gap, we propose the MedDRA tagger, a fast annotation tool that makes use of industrial level libraries such as spaCy, biomedic...

10
Large Language Models Struggle to Encode Medical Concepts - A Multilingual Benchmarking and Comparative Analysis
2025-01-15 health informatics 10.1101/2025.01.15.25320579
#1 (22.7%)
Show abstract

Interoperability in health information systems is crucial for accurate data exchange across environments such as electronic health records, clinical notes, and medical research. The main challenge arises from the wide variation in biomedical concepts, their representation across different systems and languages, and the limited context, complicating data integration and standardization. Inspired by recent advances in large language models (LLMs), this study explores their potential role as biomed...

11
PGxRAG: A Retrieval Augmented Generation supported Pharmacogenomics Assistant
2025-09-25 health informatics 10.1101/2025.09.24.25336524
#1 (22.7%)
Show abstract

Pharmacogenomics enables personalized medicine by predicting individual drug responses based on genetic makeup, but complex guideline retrieval remains challenging for clinicians, particularly in resource-limited settings. While large language models (LLMs) show promise for numerous healthcare applications, their performance on domain-specific pharmacogenomics queries without expert knowledge integration remains limited. We evaluated whether Retrieval-Augmented Generation (RAG) enhancement impro...

12
Improving Model Transferability for Clinical Note Section Classification Models Using Continued Pretraining
2023-04-24 health informatics 10.1101/2023.04.15.23288628
#1 (22.6%)
Show abstract

ObjectiveThe classification of clinical note sections is a critical step before doing more fine-grained natural language processing tasks such as social determinants of health extraction and temporal information extraction. Often, clinical note section classification models that achieve high accuracy for one institution experience a large drop of accuracy when transferred to another institution. The objective of this study is to develop methods that classify clinical note sections under the SOAP...

13
Detecting Medication Mentions in Social Media Data Using Large Language Models
2025-05-18 health informatics 10.1101/2025.05.16.25327791
#1 (22.5%)
Show abstract

The automatic extraction of medication mentions from social media data is critical for pharmacovigilance and public health monitoring. In this study, we present an end-to-end generative approach based on instruction-tuned large language models (LLMs) for medication mention extraction from Twitter. Reformulating the task as a text-to-text generation problem, our models achieve state-of-the-art results on both fine-grained span extraction and coarse-grained tweet-level classification, surpassing t...

14
Natural Language Processing for Clinical Laboratory Data Repository Systems: Implementation and Evaluation for Respiratory Viruses
2022-11-29 health informatics 10.1101/2022.11.28.22282767
#1 (22.5%)
Show abstract

BackgroundWith the growing volume and complexity of laboratory repositories, it has become tedious to parse unstructured data into structured and tabulated formats for secondary uses such as decision support, quality assurance, and outcome analysis. However, advances in Natural Language Processing (NLP) approaches have enabled efficient and automated extraction of clinically meaningful medical concepts from unstructured reports. ObjectiveIn this study, we aimed to determine the feasibility of u...

15
Fine-tuning large language models for effective nutrition support in residential aged care: a domain expertise approach
2024-07-21 health informatics 10.1101/2024.07.21.24310775
#1 (22.5%)
Show abstract

PurposeMalnutrition is a serious health concern, particularly among the older people living in residential aged care facilities. An automated and efficient method is required to identify the individuals afflicted with malnutrition in this setting. The recent advancements in transformer-based large language models (LLMs) equipped with sophisticated context-aware embeddings, such as RoBERTa, have significantly improved machine learning performance, particularly in predictive modelling. Enhancing t...

16
A Novel Sentence Transformer-based Natural Language Processing Approach for Schema Mapping of Electronic Health Records to the OMOP Common Data Model
2024-03-24 health informatics 10.1101/2024.03.21.24304616
#1 (22.5%)
Show abstract

Mapping electronic health records (EHR) data to common data models (CDMs) enables the standardization of clinical records, enhancing interoperability and enabling large-scale, multi-centered clinical investigations. Using 2 large publicly available datasets, we developed transformer-based natural language processing models to map medication-related concepts from the EHR at a large and diverse healthcare system to standard concepts in OMOP CDM. We validated the model outputs against standard conc...

17
Evaluating machine learning approaches for multi-label classification of unstructured electronic health records with a generative large language model
2024-06-27 health informatics 10.1101/2024.06.24.24309441
#1 (22.5%)
Show abstract

Multi-label classification of unstructured electronic health records (EHR) is challenging due to the semantic complexity of textual data. Identifying the most effective machine learning method for EHR classification is useful in real-world clinical settings. Advances in natural language processing (NLP) using large language models (LLMs) offer promising solutions. Therefore, this experimental research aims to test the effects of zero-shot and few-shot learning prompting, with and without paramet...

18
A Scalable Framework for Benchmarking Embedding Models for Semantic Medical Tasks
2024-08-20 health informatics 10.1101/2024.08.14.24312010
#1 (22.4%)
Show abstract

Text embeddings convert textual information into numerical representations, enabling machines to perform semantic tasks like information retrieval. Despite its potential, the application of text embeddings in healthcare is underexplored in part due to a lack of benchmarking studies using biomedical data. This study provides a flexible framework for benchmarking embedding models to identify those most effective for healthcare-related semantic tasks. We selected thirty embedding models from the mu...

19
CONORM: Context-Aware Entity Normalization for Adverse Drug Event Detection
2023-09-26 health informatics 10.1101/2023.09.26.23296150
#1 (22.4%)
Show abstract

Adverse drug events (ADEs) are a critical aspect of patient safety and pharmacovigilance, with significant implications for patient outcomes and public health monitoring. The increasing availability of electronic health records, social media, and online patient forums provides valuable yet challenging unstructured data sources for ADE surveillance. To address these challenges, we introduce CONORM, a novel framework integrating named entity recognition (NER) and entity normalization (EN) for ADE ...

20
Developing a natural language processing system using transformer-based models for adverse drug event detection in electronic health records
2024-07-10 health informatics 10.1101/2024.07.09.24310100
#1 (22.3%)
Show abstract

ObjectiveTo develop a transformer-based natural language processing (NLP) system for detecting adverse drug events (ADEs) from clinical notes in electronic health records (EHRs). Materials and MethodsWe fine-tuned BERT Short-Formers and Clinical-Longformer using the processed dataset of the 2018 National NLP Clinical Challenges (n2c2) shared task Track 2. We investigated two data processing methods, window-based and split-based approaches, to find an optimal processing method. We evaluated the ...